arXivrl 502.05067V 1 [cs.SI] 17 Feb 2015 


February 19, 2015 2:0 WSPC/INSTRUCTION FILE 


sgs 


Advances in Complex Systems 
© World Scientific Publishing Company 


NODE MIXING AND GROUP STRUCTURE OF 
COMPLEX SOFTWARE NETWORKS 


LOVRO SUBELJ 

University of Ljubljana, Faculty of Computer and Information Science, 
Trzaska cesta 25, SI-1001 Ljubljana, Slovenia 
lovro.subelj@fri.uni-lj.si 

SLAVKO ZITNIK 

University of Ljubljana, Faculty of Computer and Information Science, 
Trzaska cesta 25, SI-1001 Ljubljana, Slovenia 
slavko.zitnik@fri.uni-lj.si 

NELI BLAGUS 

University of Ljubljana, Faculty of Computer and Information Science, 
Trzaska cesta 25, SI-1001 Ljubljana, Slovenia 
neli. blagus@fri.uni-lj.si 

MARKO BAJEC 

University of Ljubljana, Faculty of Computer and Information Science, 
Trzaska cesta 25, SI-1001 Ljubljana, Slovenia 
marko. bajec @fri. uni-lj. si 

Received (received date) 

Revised (revised date) 

Accepted (day month year) 

Communicated by (xxxxxxxxxx) 


Large software projects are among most sophisticated human-made systems consisting of 
a network of interdependent parts. Past studies of software systems from the perspective 
of complex networks have already led to notable discoveries with different applications. 
Nevertheless, our comprehension of the structure of software networks remains to be 
only partial. We here investigate correlations or mixing between linked nodes and show 
that software networks reveal dichotomous node degree mixing similar to that recently 
observed in biological networks. We further show that software networks also reveal 
characteristic clustering profiles and mixing. Hence, node mixing in software networks 
significantly differs from that in, e.g., the Internet or social networks. We explain the 
observed mixing through the presence of groups of nodes with common linking pat¬ 
tern. More precisely, besides densely linked groups known as communities, software net¬ 
works also consist of disconnected groups denoted modules, core/periphery structures 
and other. Moreover, groups coincide with the intrinsic properties of the underlying 
software projects, which promotes practical applications in software engineering. 

Keywords: Software networks; node mixing; node groups; software engineering. 


1 



February 19, 2015 2:0 WSPC/INSTRUCTION FILE 


2 L. Subelj et al. 

1. Introduction 

Large software projects are one of the most sophisticated and diverse human-made 
systems; still, our comprehension of their complex structure and behavior remains 
to be only partial [5]. On the other hand, studies on modeling software systems 
as networks of interdependent parts have recently led to some notable discoveries 
and promoted different applications Haile]. Complex networks possibly provide 
the most adequate framework for the analysis of large software systems developed 
according to object-oriented, structured programming and other paradigms [3Q1IM] . 

Past studies have already shown that software systems modeled as directed net¬ 
works are scale-free [2] with a power-law in-degree distribution and, e.g., exponential 
out-degree distribution EaESj. Furthermore, networks are small-world m, when 
represented with undirected graphs isniEa], and reveal a hierarchical [55] and frac¬ 
tal structure [Hill]. The latter can be, similarly as the properties mentioned above, 
related to code complexity or reusability and the quality of the underlying software 
projects mi ED. Authors have also proposed different growing models of software 
networks mEsim and investigated the importance of particular nodes in the 
networks [23|, their evolution during project execution [5|, practical applications of 
network community and motif structure [54l [46] , and other m- 

In the present paper, we first analyze the correlations or mixing miss between 
linked nodes in software networks, which has not yet been addressed properly. De¬ 
spite a common belief that software networks are negatively correlated or disas- 
sortative by degree mus] as, e.g., web graphs or the Internet [38|, we show that 
networks are indeed strongly disassortative by in-degree, but much more positively 
correlated or assortative by out-degree, otherwise a characteristic property of differ¬ 
ent social networks [36] . Software networks thus reveal dichotomous degree mixing, 
similar to that recently detected in undirected biological networks m- 

We further show that software networks are characterized by a sickle-shaped 
clustering [57j profile also observed in [19]. This unique shape is retained in the 
case of degree-corrected clustering [43], whereas the structure of the networks differs 
significantly from that of the Internet or a social network. More precisely, software 
networks contain connected parts or regions with very low or very high degree- 
corrected clustering (Figure [^, which is else observed only for either the Internet 
or a social network. Nevertheless, all types of networks reveal clear degree-corrected 
clustering assortativity that has not been reported in the literature before. 

We explain the observed degree mixing and clustering assortativity through the 
presence of different types of groups or clusters of nodes with common linking pat¬ 
tern [35] . Besides densely linked groups denoted communities m , software networks 
also consist of groups of structurally equivalent nodes denoted modules [48] , and dif¬ 
ferent mixtures m of these, with core/periphery and hub & spokes structures as 
special cases. We stress that the existence of different types of groups implies high 
clustering assortativity, with sparse module-like groups occupying regions with very 
low clustering and dense community-like groups in regions with higher clustering. 
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Fig. 1. Software dependency network representing the Lucene search engine. (Nodes with degree- 
corrected clustering m above or below the mean are shown as circles and triangles, respectively.) 


While the former explain the observed disassortativity by degree, the latter in fact 
promote the assortativity in the out-degree. Note that the conclusions are consis¬ 
tent with the results obtained for the Internet and a social network, where mostly 
module-like or community-like groups are found, respectively. 

Although the main purpose of the analysis of node mixing is to relate charac¬ 
teristic group structure to the existing network properties, the dichotomous degree 
mixing in fact implies many of the common properties of real-world networks m 
(e.g., robustness). The latter, together with the observed node clustering assorta¬ 
tivity, might be of independent interest in network model design and other. 

The paper does not provide a clear rationale behind the existence of different 
types of groups in software networks. Nevertheless, the revealed groups are found 
to closely coincide with some of the intrinsic properties of the underlying software 
projects. The paper thus also includes preliminary work and results of selected 
applications of network group detection in software engineering. 

The rest of the paper is structured as follows. For the analysis in the paper, we 
adopt software dependency networks based on [33147] , which are introduced in Sec- 
tion|^ Next, Sectionj^contains an extensive empirical analysis and formal discussion 
on node degree and clustering mixing. Analysis of the characteristic groups of nodes 
in software networks is conducted in Section while some practical applications of 
group detection in software engineering are given in Section Section concludes 
the paper and gives prominent directions for future work. 

2. Software dependency networks 

Complex software systems can be modeled with various types of networks including 
software architecture maps [52], class diagrams [54|, inter-package m and class 
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class C extends S implements I { 

F field ; 

public C() { ... } 

void too(P parameter) { ... } 

private R bar() { ... } 


(a) 


(b) 



Fig. 2. (a) A toy example class written in Java and (b) the corresponding class dependency network. 


dependency networks [46], class, method and package collaboration graphs m, 
software mirror [5| and subroutine call graphs m, to name just a few. Networks 
mainly divide whether they are constructed from source code, byte code or program 
execution traces, and due to the level of software architecture represented by the 
nodes and the types of software relationships represented by the links. 

For consistency with most past work, we consider class dependency networks |46l 
ST] that are suitable for modeling object-oriented software systems. Here, nodes 
represent software classes and links correspond to different types of dependencies 
among them (e.g., inheritance). More formally, let a software project consist of 
classes C = {Ci,C 2 ,...}. Corresponding class dependency network is a directed 
graph G{V,L), where V = {l,2,...,n} is the set of nodes and L is the set of 
links, m = \L\. Class Ci is represented by a node i G H, while a directed link 
(i,j) G L corresponds to some dependency between classes Ci and Cj (Figure]^. 
This can be either an inheritance (i.e., Ci extends class or implements interface Cj), 
a composition (i.e., Ci contains a field or variable of type Cj) or a dependence (i.e., 
Ci contains a constructor, method or function with parameter or return type Cj). 

Note that class dependency networks are constructed merely from the signatures 
of software classes, and fields and functions therein. Thus, the networks address the 
inter-class structure of the software systems, whereas the intra-class dependencies 
are ignored m- However, as such information is often decided by a team of devel¬ 
opers, prior to the actual software development, it is not influenced by the program¬ 
ming style of each individual developer. Moreover, such networks coincide with the 
flow of information and also the human comprehension of object-oriented software 
systems. Nevertheless, the networks still give only a partial view of the system. 

According to the object-oriented programming paradigm, a class that extends 
a parent class also inherits all of its functionality (not considering the visibility). 
Hence, each class implicitly acquires the dependencies of its parent class, the par¬ 
ent class of its parent class, and so on. For the analysis in the paper, we thus 
first construct the networks based on the explicit class dependencies as described 
above, while we then copy also the implicit dependencies of each class from its par¬ 
ent classes. This provides somewhat more adequate representation of the intrinsic 
structure of the software system and also coincides with the developer’s view. Note 
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that the process does not significantly increase the overall number of dependencies 
(see below). Finally, we reduce the networks to simple directed graphs, to limit 
the influence of individual developers as above. Networks thus utilize merely the 
connectedness between the nodes, while disregarding its strength. We consider four 
such software dependency networks that are shown in Table (see also Figure [^. 
All selected networks represent well-known software projects developed in Java in¬ 
cluding physics simulation, scientific computing and network analysis libraries. 


Table 1. Software, Internet and social networks used in the study. (The values in brackets 
show the number of links corresponding to explicit class dependencies.) 


Network 

Description 

n 

'i 

m 

jbullet 

JBullet 2.72 game physics simulation toolbox 

166 

619 

(552) 

colt 

Colt 1.2.0 scientific & technical computing library 

227 

963 

(709) 

jung 

JUNG 2.0.1 network & graph analysis framework 

306 

930 

(713) 

lucene 

Lucene 4.1.0 high-performance text search engine 

1657 

6808 (6252) 

internet 

Oregon 2003 autonomous systems snapshot [26] 

767 

1857 

- 

collaboration 

Network scientists collaborations |33| 

1589 

2742 

- 


Note: Software networks are reduced to largest connected components 


For a thorough empirical comparison in the following sections, we also consider 
two other real-world networks. Namely, a snapshot of communications between au¬ 
tonomous systems of the Internet collected by the University of Oregon in 2003 [26] 
and a social network of collaborations between scientists working on network the¬ 
ory and experiment |33| (Table [^. These are simple undirected networks. Although 
some directed social and technological networks would enable more straightforward 
comparison, such networks are commonly either much larger than software networks 
or do not reveal particularly clear group structure. On the other hand, we stress that 
the selected networks represent two fundamentally different topologies. While social 
networks are characterized by a dense degree assortative structure and community- 
like groups [m |36] , the Internet is much sparser and disassortative by degree [38] . 
Also, the prevalent groups of nodes are module-like, e.g., hub & spokes [25]. 


3. Node mixing in software networks 

The present section contains an extensive comparative analysis of different networks 
according to node degree and clustering mixing. We first review characteristics of 
node degree distributions in Sect ion [tT] and then show that software networks reveal 


dichotomous degree mixing in Section 3.2 Next, sickle-shaped clustering profiles of 


software networks are explored in Section 3^ , while Section |3.4| provides empirical 
evidence of node clustering assort at ivity in real-world networks. 
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3.1. Scale-free node degree distributions 

Let ki be the degree of node i G V and let {k) be the mean degree in the network. 
For directed networks, the degree is defined as the sum of in-degree and out-degree. 
Next, let A be the maximum degree, and and Aout fhe maximum in-degree 
and out-degree, respectively. Last, let 7 be the scale-free exponent of the power-law 
degree distribution P{k) ^ k~^ [ 2 ], 7 > 1 , and let 7 ^^ and ^out be the exponents 
corresponding to in-degree and out-degree distributions, respectively. The values of 
7 -s were estimated by maximum-likelihood method with goodness-of-fit tests [7]. 

Table describes node degree sequences of different networks. The degree {k) 
is somewhat comparable across software networks and approximately half the size 
for internet and eollaboration networks. Observe, however, that in the case of di¬ 
rected software networks the values of A-s and 7 -s are obviously governed by a 
much broader in-degree sequences, compared to a relatively suppressed out-degree 
sequences (e.g., lueene network). Particularly, as past work has already shown, soft¬ 
ware networks have scale-free in-degree distribution that follows a power-law with 
2 < Jin < 3 [52] and highly truncated, e.g., log-normal |9] or exponential [53], out- 
degree distribution (see Table [^. Note also that the tail of the (in-)degree distribu¬ 
tion of lueene software network is well modeled by the scale-free degree distribution 
of a sparse topology of the Internet, while, from the perspective of out-degrees, the 
network is somewhat more similar to a dense assortative social network (Figure]^. 


Table 2. Node degree sequences of different networks. (The expo¬ 
nents 7 -s in italics do not represent a valid fit to a power-law [?].) 


Network 

(k) 

A 

^in 

^out 

7 

'lin 

'lout 

jbullet 

7.46 

62 

62 

22 

2.80 

2.26 

4.04 

colt 

8.48 

140 

140 

13 

2.56 

2.56 

3.91 

jung 

6.08 

95 

92 

12 

2.65 

2.77 

4.47 

lueene 

8.22 

337 

333 

20 

2.24 

2.14 

4.91 

internet 

4.68 

303 

- 

- 

2.28 

- 

- 

eollaboration 

3.45 

34 

- 

- 

2.85 

- 

- 


For the concerned software dependency networks, in-degree and out-degree se¬ 
quences have a rather clear meaning in software engineering. The out-degree of node 
i corresponds to the number of classes required to implement the functionality of 
class Ci and is thus a measure of ’external’ complexity m- Indeed, different soft¬ 
ware quality metrics are based on the out-degrees of nodes in software networks [ 6 ] 
El]. On the other hand, the in-degree of node i corresponds to the number of classes 
that depend on or use class Ci and is related to the level of code reusability m- 
Highly reused classes are, obviously, well known among developers and are thus 
also more commonly used in the future. The latter is exactly the principle behind the 
preferential attachment model [ 2 ] , which produces power-law in-degree distribution 
in software dependency networks m- For the case of the out-degree distribution. 
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lucene 


internet 


collaboration 





Fig. 3. Node degree distributions of larger networks (see also Table |^. Note that lucene software 
network reveals scale-free (in-)degree distribution as the Internet and a truncated, e.g., log-normal 
or exponential, out-degree distribution more similar to the collaboration network. 


long scale-free tail is suppressed by constant incremental refactoring of classes within 
a growing software project [3] (to reduce its complexity), while such distribution 
also results from a certain class of software duplication mechanisms [53] . 


3.2. Dichotomous node degree mixing 


The most straightforward way to analyze node degree mixing in general networks is 
to measure r muss, which is defined as a Pearson correlation coefficient of degrees 
at links’ ends, r G [—1,1]. Hence, 


r 


1 

2 ( 7 /- 


{h - (k)) {kj -{k)), 


( 1 ) 


where ap is the standard deviation, i.e., ap = y ~ (^))^* Assortative mix¬ 

ing by degree shows as a positive correlation r > 0, while disassortative degree 
mixing refers to a negative correlation r < 0. For the case of directed networks, one 
can similarly define four additional coefficients [E], (a, /d G {in, out}, where a, 

/3 correspond to the types of degrees of links’ source and target nodes, respectively. 

Table summarizes degree mixing in different networks. As already stated be¬ 
fore, social networks reveal strong assortative mixing m (e.g., collaboration net- 


Table 3. Node degree mixing coefficients ca of different networks. 


Network 

r 


{in,out) 

{out,in) 

{out,out) 

jbullet 

-0.21 

-0.29 

-0.07 

-0.26 

-0.14 

colt 

-0.24 

-0.27 

-0.06 

-0.25 

-0.28 

jung 

-0.22 

-0.25 

-0.05 

-0.24 

-0.13 

lucene 

-0.28 

-0.30 

0.00 

-0.29 

-0.04 

internet 

-0.26 

- 

- 

- 

- 

collaboration 

0.46 

- 

- 

- 

- 
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lucene 


internet 

1 ri2 

collaboration 
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A In-degree kin 
• Out-degree kout 
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1 10 10^ 

Node degree k 


1 10 10^ 
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Node degree k 


Fig. 4. Neighbor connectivity plots |38| of larger networks (see also Table[^. Note that lucene soft¬ 
ware network reveals dichtomous degree mixing that is disassortative by in-degree as the Internet 
and assortative by out-degree as social networks (e.g., collaboration network). 


work), whereas the Internet is degree disassortative [38] . Software networks also ap¬ 
pear to be disassortative by degree according to r US]. Nevertheless, this is actually 
a consequence of the prevailing in-degree sequences (see Section 3.1). The networks 
are indeed highly disassortative by in-degree, ^ 0, though much more as¬ 

sortative by out-degree in most cases, r(^out,out) ^ ^(in,in) (^-g-, lucene network). 
Expectedly, r(^in^out) reveals no clear mixing regime, r(^in^out) - 0, while r(^out,in) 
again governed by the dominant in-degrees, 

Note that above coefficients provide a rather limited global view of degree mixing 
and can capture merely linear correlations. Figure]^ shows also neighbor connectiv¬ 
ity plots [38] that display mean neighbor degree against node degree k. Here, 
assortative or disassortative mixing reflects in either increasing or decreasing trend, 
respectively. While the software network is clearly disassortative by in-degree, it is 
in fact slightly assortative by out-degree, as in the case of a social network. Further¬ 
more, the degrees k show a clear two-phase or dichotomous mixing that is controlled 
by out-degrees for smaller k, and by in-degrees, when k increases. Although one can 
also observe some dichotomous behavior for collaboration and internet networks, 
this does not appear significant and can be due to the size of the networks. Thus, as 
previously claimed, software networks reveal dichotomous degree mixing and differ 
from other degree disassortative networks like web graphs and the Internet. 

It ought to be mentioned that similar observations were recently made also in 
undirected biological networks [19] . Although these are disassortative by degree [29] , 
removing a certain percentage of high degree nodes or hubs [18] renders the networks 
degree assortative. Since hubs in software networks correspond to nodes with high 
in-degree (see Table [^, our work generalizes that in m to directed networks. 

Dichotomous degree mixing in software networks can be seen as a product of 
different programming paradigms. Recall that the out-degree of a node measures the 
complexity of the corresponding software class, whereas its in-degree is related to 
class reuse (see Section 3.1). Disassortativity in the in-degrees can be interpreted as 
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low probability of hubs to link; thus, highly reused classes tend not to depend on each 
other. Since these commonly implement a rather different functionality, the latter is 
in fact a result of minimum-coupling and maximum-cohesion principle [45] . On the 
other hand, object-oriented software systems are commonly developed according 
to Lego hypothesis |3], where smaller and simpler classes are used to implement 
larger and more complex ones, and so on. As this results in an entire hierarchy of 
classes with increasing complexity across the levels of the hierarchy, a class depends 
only on classes with rather similar complexity, i.e., classes from the previous level. 
Obviously, this implies assortativity in the out-degrees in software networks. 

3.3. Sickle-shaped node clustering profiles 

Besides degree distributions and mixing considered above, real-world networks are 
commonly assessed due to their transitivity. For simple undirected graphs, this can 
be measured by node clustering coefficient c [57], c G [0,1], defined as 


where U is the number of links between the neighbors of node i £ V and ( 2 ") 
is the maximal number of links (q = 0 for ki < 1). Note that the denominator 
in Eq. ([^ introduces biases in the definition, since ( 2 ') often cannot be reached due 
to a fixed degree sequence [43] (see below). Thus, an alternative definition of node 
degree-corrected clustering coefficient d [43], d G [0,1], has been proposed as 

di = (3) 

Ldi 

where uji is the maximal possible number of links between the neighbors of node i 
with respect to their degrees {di = 0 for < 1). Since uj < ( 2 ), d > c by definition. 

Table shows the mean node (degree-corrected) clustering (c) and (d) in dif¬ 
ferent networks. As these are small-world m, (c) and (d) are considerably larger 
than the expected clustering coefficient p in a corresponding random graph m, 
p = {k) / {n — 1). The structure of collaboration network else reveals the most 


Table 4. Node clustering coefficients of different networks. 


Network 

{c) 

(d) 

d= 1 

d < p 




(% nodes) 

jhullet 

0.43 

0.50 

9 % 

20 % 

colt 

0.50 

0.58 

17% 

13% 

jung 

0.51 

0.58 

19% 

19% 

lucene 

0.50 

0.55 

11% 

13% 

internet 

0.29 

0.32 

21% 

55% 

collaboration 

0.64 

0.69 

61% 

28% 


Note: Networks are reduced to simple undirected graphs 
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internet 



collaboration 



Fig. 5. Node clustering |57| profiles of larger networks (see also TableNote degree biases intro¬ 
duced in the standard definition of clustering that imply low values for hubs, which is particularly 
apparent in degree disassortative networks (e.g., lucene and internet networks). 


densely linked neighborhoods, where the majority of nodes have d equal to one 
(see Table 1^. Exactly the opposite holds for internet network, where d is close 
to zero, d < p, in most cases. On the other hand, software networks are again 
characterized by an interplay between the dense structure of social networks and 
the sparse topology of the Internet. Most of the nodes have moderate values of d, 
p < d < 1, whereas nodes with either very low or high d are concentrated in certain 
parts of the networks (not shown). 

We next consider node (degree-corrected) clustering profiles shown in Figure 
and FigureOne can observe degree biases in the standard definition of clustering c 
that imply low c for hubs (see Eq. ([^), particularly apparent in degree disassortative 
networks (see Figure[^. More precisely, c decreases rapidly with k, roughly following 
a power-law form c ^ k~^ in the case of the Internet [56l[43]. Note that these biases 
are absent from the degree-corrected definition of clustering d (see Figure [^, which 
thus provides somewhat more adequate measure of network transitivity. 

Notice also very peculiar sickle-shaped (degree-corrected) clustering profiles re¬ 
vealed for the software network (see, e.g.. Figure]^. This unique form is most no¬ 
tably pronounced in the case of out-degrees and is, at least in the undirected case, an 
artifact of dichotomous node degree mixing [19] . On the contrary, profiles of internet 
and eollaboration networks show no particular scaling for degree-corrected clustering 
d (see Figure]^, consistent with the analysis of node degree mixing in Section 3.2 


Nevertheless, all networks considered here reveal clear degree-corrected clustering 
assortativity, which is throughly investigated in the following section. 

Same as before, (degree-corrected) clustering profiles in software networks can be 
related to the intrinsic properties of the underlying software systems |46l|47]. While 
nodes that represent core classes of a software project commonly group together into 
dense neighborhoods with high clustering, nodes with lower clustering most often 
correspond to different implementations of the same functionality (see Figure 14). 
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lucene 


internet 


collaboration 
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10 10^ 
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Fig. 6. Node degree-corrected clustering m profiles of larger networks (see also Table Note 
that lucene software network reveals a sickle-shaped clustering profile most notably pronounced 
for out-degrees, which is absent in the case of the Internet and the collaboration network. 


3.4. Node degree-corrected clustering assortativity 

The present section explores node (degree-corrected) clustering mixing in different 
networks. For this purpose, we define clustering mixing coefficient Tc, Vc G [—1,1], as 

n = ^ ki - (c)) {cj - (c)) (4) 


and similarly Pd for degree-corrected clustering coefficient. Vc and Vd are again just 
Pearson correlation coefficients of (degree-corrected) clustering at links’ ends and 
are shown in Table Due to degree biases in c (see Section 3.3), rc > 0 in degree 
assortative networks (e.g., collaboration network), while Tc < 0 for networks that are 
disassortative by degree (e.g., lucene network). On the other hand, all networks show 
clear degree-corrected clustering assortativity with Vd ^ 0 (see also Figure]^. Note 
also that correlations reflected in Pd are much stronger than in the case of degree 
mixing coefficients r-s (see Table |^. To the best of our knowledge, this distinctive 
property of real-world networks has not yet been reported in the literature. 


Table 5. Node clustering mixing 
coefficients of different networks. 


Network 

Tc 

Td 

jhullet 

-0.06 

0.50 

colt 

-0.26 

0.35 

jung 

-0.07 

0.33 

lucene 

-0.40 

0.50 

internet 

-0.23 

0.26 

collaboration 

0.44 

0.68 


Note: Networks are reduced to 
simple undirected graphs 
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lucene 
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2 


-0 0.5 1 

Node (degree-corrected) clustering 




I 

I 
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Node (degree-corrected) clustering 
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Fig. 7. Neighbor (degree-corrected) clustering plots of larger networks (see also Table[^. Note that 
all networks reveal a clear degree-corrected clustering m assortativity (e.g., lucene network), 
which is absent from the standard definition of clustering EH (e .g., internet network). 


According to Section nodes in software networks have very different values 
of degree-corrected clustering d, which is not true for social networks or the Internet. 
Together with strong assortativity ^ 0, this in fact implies entire connected parts 
or regions of nodes with rather similar d (e.g., very low or high). The latter can be 
clearly seen in Figure while, in the following section, we explain degree-corrected 
clustering assortativity, and dichotomous degree mixing observed in Section |3.2[ 
through the presence of characteristic groups of nodes with common linking pat¬ 
tern [35] . More precisely, dense community-like groups occupy network regions with 
higher d and imply degree assortativity, while sparse module-like groups are found 
in regions with lower d and are responsible for degree disassortativity. 


4. Group structure of software networks 


Node group structure of different networks is explored using a principled group 
extraction framework based on |49l [59| . The present section thus first introduces the 


framework and corresponding formalisms in Section [4T| while Section 4.2 reports 
the characteristic group structure revealed in software and other networks. Last, 
Sect ion [Oj relates different types of groups to degree and clustering mixing observed 
in Section which uniquely characterizes the structure of these networks. 


4.1. Node group extraction framework 

The formalism proposed in [49| defines network groups for the case of simple undi¬ 
rected graphs. Let S' be a group of nodes and T a subset of nodes representing its 
characteristic linking pattern, S,T C V. Also, let s = |S| and t = \T\. The node 
pattern T is defined thus to maximize the number of links between S and T, and 
minimize the number of links between S and T^, while disregarding the links with 
both endpoints in . Note that this simple formalism allows one to derive most 
types of groups commonly analyzed in the literature [inilH] (Figure [^. 

For instance, communities densely linked groups of nodes that are only 

sparsely linked between, are characterized hy S = T. On the other hand, S' D T = 0 
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(a) Community (b) Core/periph. (c) Mixture (d) Module (e) Hub & spokes 

Fig. 8. Toy examples of different types of groups of nodes in real-world networks (see also text). 
(Groups S and corresponding patterns T are shown with filled and marked nodes, respectively.) 






corresponds to groups of structurally equivalent [28] nodes denoted modules [48] , 
Communities and modules represent two extreme cases, with all other groups being 
the mixtures of the two [49]. For the analysis in the paper, we thus distinguish 
between three types of groups according to the following definitions. 

Definition 1. Community is a group of nodes S with S = T. 

Definition 2. Module is a group of nodes S with S DT = 0. 

Definition 3. Mixture is a group of nodes S with S CT G S,T. 

All these groups have been extensively analyzed in the past [42j [40l [131 lM|- 
Clear communities appear in different social and information networks m HQ], 
while modules are most commonly found in the case the Internet, biological and 
technological networks [39l|48]. For consistency, we also consider two special cases. 

Definition 4. Core/periphery structure is a mixture S with either S CT otT C S. 

Definition 5. Hub & spokes structure is a module S with t = 1. 


According to the above definitions, one can in fact determine the type of some 
group S by considering Jaccard index [ 22 ] of S and T. We thus define a group type 
parameter r [49], r G [0,1], as 


t{S,T) 


|5nr| 

|5UT|' 


(5) 


Communities have r = 1 , whereas modules are indicated by r = 0. Mixtures corre¬ 
spond to groups with 0 < r < 1. For the remaining of the paper, we refer to groups 
with r ~ 1 or r ~ 0 as community-like and module-like groups, respectively. 

The framework presented below is based on a group criterion W [49], W G [0,1]. 


W{S, T) = T) (1 - T)) , (6) 


where 1/(5', T) is the number of links between 5 and T, i.e., 1/(5, T) = d{i G 

5, j G T), and ii{S^T) is the geometric mean of s and t normalized by the number 
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of nodes fi ^ [0,1]. 




2st 

n{s + 1) 


( 7 ) 


Notice that W is an asymmetric criterion that favors the links between S and T, 
and penalizes for the links between S and . Since the links with both endpoints 
in are not considered, W is also a local criterion. We stress that, at least for 
the case S = T, criterion W has a natural interpretation in a wide class of different 
generative graph models [59] (e.g., block models [58]). Factor /i(l — /i) in Eq. © 
prevents from extracting either very small or large groups with, e.g., 5 = 1. 

We next present the adopted group extraction framework @21 EH]. The frame¬ 
work extracts groups from the network sequentially, one by one, as follows. First, 
one finds group S and its corresponding pattern T that maximize criterion W using, 
e.g., tabu search m with varying initial conditions for S and T. At each step of 
the search, a single node is swapped in either S or T. Next, to extract the revealed 
group S from the network, one removes merely the links between S and T, and any 
node that might thus become isolated. The entire procedure is then repeated on the 
remaining network until criterion W is larger than the value expected under the 
same framework in a corresponding Erdos-Renyi random graph m- The latter is 
estimated by a simulation, thus, all groups reported in the remaining of the paper 
are statistically significant at the 1% level (see [59] for further details). 

Note that the framework allows for overlapping @ 3 , hierarchical @ 3 , nested 
and other classes of groups commonly found in real-world networks. Nevertheless, 
it explicitly guards against extracting groups that are not statistically significant. 
We refer to the network structure remaining after the extraction as background. 


4.2. Characteristic node group structure 

Table [^ summarizes the basic properties of node groups extracted from different 
networks. Notice that the mean group size {s) is somewhat comparable across soft¬ 
ware networks, where a characteristic group consists of around ten nodes. The mean 
pattern size (t) is slightly smaller, but still comparable to {s) (e.g., jung network). 


Table 6. Node groups and corresponding patterns extracted from different networks. 


Network 

# 

Group 

{5> 

(i> 

Community 

Core/periphery 
# ({5>) 

Mixture 

Module 

jbullet 

14 

9.0 

8.4 

5 (7.8) 

1 (12.0) 

6 (12.2) 

2 (5.5) 

colt 

15 

10.3 

8.3 

3 (8.3) 

1 43-0) 

9 42 . 6 ) 

2 (6.5) 

jung 

30 

8.7 

7.8 

18 (9.9) 

1 5o.o) 

5 (9.6) 

6 (5.7) 

lucene 

123 

12.1 

7.9 

55 (8.6) 

2 44 . 5 ) 

27 (15.7) 

39 (14.7) 

internet 

33 

10.6 

4.5 

1 (4.0) 

1 (29.0) 

3 (19.0) 

28 (9.6) 

collaboration 

160 

5.6 

5.6 

143 (5.6) 

0 (0.0) 

12 (6.8) 

5 (3-0) 


Note: Networks are reduced to simple undirected graphs 
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lucene 


collaboration 


Random 
9 Community 
■ Mixture 
^ Module 




Group sequence S 




Group sequence S 


Fig. 9. Node group sequence extracted from larger networks (see also Table [^. Note that lucene 
software network contains communities, which are commonly found in social networks (e.g., col¬ 
laboration network), modules like the Internet, and also different mixtures of these. 


On the other hand, {s) {t) for the Internet, due to an abundance of hub & spokes- 

like modules. Since social networks are characterized by a pronounced community 
structure [36], expectedly, {s) « (t) for collaboration network. 

By examining the types of the revealed groups (see Table |^, one observes a 
very clear distinction between different networks. As already indicated above, col¬ 
laboration network consists of almost only communities. On the contrary, 85% of 
the groups found in internet network are modules. Software networks, however, are 
characterized by communities, modules and different mixtures of these (e.g., lucene 
network). Thus, as already argued in the case of node mixing in Section]^ software 
networks represent a unique mixture of dense community-like structure of social net¬ 
works and sparse module-like topology of the Internet. For a better comprehension. 
Figure]^ shows most significant groups extracted from the networks. 

Characteristic group structure of different networks is also reflected in the mean 
group parameter (r) (Table [7|). Indeed, (r) is almost zero or one for internet and 
collaboration networks, respectively. For software networks, (r) is between 0.4 and 


Table 7. Node group structure revealed in different networks (see also Table [^. Note that charac¬ 
teristic topology of different networks is well characterized by the mean group parameter (r). 


Network 

Group 

{^> 

Community 

Core/periph. Mixture 

% Links (% nodes) 

Module 

a 

Background 

jbullet 

0.63 

15% (22%) 

8% (7%) 

53% (42%) 

6% (7%) 

19% (66%) 

colt 

0.41 

7% (11%) 

5% (6%) 

69% 59%) 

4% (6%) 

15% (64%) 

jung 

0.66 

62% (51%) 

3% (3%) 

12% 56%) 

10% (11%) 

12% ^4%) 

lucene 

0.55 

19% (25%) 

1% (2%) 

30% ^4%) 

38% ^4%) 

11% (49%) 

internet 

0.08 

0% (1%) 

12% (4%) 

13% (7%) 

34% (35%) 

41% (80%) 

collaboration 

0.94 

71% (47%) 

0% (0%) 

6% (5%) 

1% (1%) 

22% {A7%) 


Note: Networks are reduced to simple undirected graphs 
^ Nodes can be included in multiple overlapping groups 

















February 19, 2015 2:0 WSPC/INSTRUCTION FILE 


16 L. Subelj et al. 


0.65, as discussed above. Table reports also the proportion of links explained by 
the group structure, and the proportion of nodes included in the groups. Despite the 
fact that group structure provides a rather coarse-grained abstraction of a network, 
the reveled groups explain 80-90% of the links in software and social networks, and 
almost 60% for the Internet. Also, groups contain most of the nodes in the networks. 


As already discussed in Section |3^ different types of groups observed in software 
networks actually coincide with the intrinsic dynamics of the underlying software 
systems. More precisely, core classes of a software project commonly form dense 
inheritance hierarchies, while they also provide different convenience methods for 
transforming other core classes. Consequently, corresponding nodes in class depen¬ 
dency networks cluster together and form communities [4^ [48] (see Figure 14). 
Moreover, software projects commonly consist of classes that represent indepen¬ 
dent implementations of the same functionality (e.g., different group detection al¬ 
gorithms). By definition, these do not depend on each other; however, they do 
depend on a similar set of other classes. Hence, corresponding nodes in software 
networks aggregate together into module-like groups I1H1I17] (see Figure [l4|). Simi¬ 
larly as above, mixtures of nodes in software networks are often just an artifact of 
different programming principles and practical limitations of software systems. 

Notice also particularly module-like structure of colt network compared to other 
software networks (see (r) in Table [^. Since the network represents a software li¬ 
brary for complex scientific and technical computing, high performance and scalabil¬ 
ity are of much greater importance than the system extensibility and future reusabil¬ 
ity. While the latter implies a modular design according to minimum-coupling and 
maximum-cohesion paradigm [45] and, consequently, a community-like structure of 
software networks [46] , the former demands a great deal of code duplication, which 
in fact promotes module-like groups in software networks [48]. Equivalently, net¬ 
works that correspond to software projects with particularly modular design reveal 
more community-like structure (e.g., jung network). Group structure of software net¬ 
works thus reflects different programming principles and paradigms followed during 
project development, which could be used for software quality control. 

Preliminary work on practical applications of network group detection in soft¬ 
ware engineering is described in Section [^ while, in the following section, we relate 
the characteristic group structure of software networks to previously observed di¬ 
chotomous node degree mixing and degree-corrected clustering assortativity. 


4.3. Group degree and clustering mixing 

Section shows that software networks are characterized by dichotomous node 
degree mixing that is assortative from the perspective of out-degrees, and disas- 
sortative from the perspective of in-degrees. Moreover, networks are composed of 
regions with rather similar clustering and reveal strong degree-corrected cluster¬ 
ing assortativity. We have postulated a hypothesis that the observed structure is 
a consequence of different types of groups of nodes present in the networks. More 
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lucene 




■S’J. 


Degree k 
In-degree kjn 
Out-degree kout 


Node degree k 


internet collaboration 




Fig. 10. Group degree profiles of larger networks that reveal no characteristic scaling. 


precisely, software networks contain dense community-like groups in regions with 
higher clustering, which imply assortativity in the out-degree, and sparse module- 
like groups in regions with lower clustering, which promote disassortativity by in¬ 
degree, and different mixtures of these. As already discussed before, existence of 
different groups immediately explains also degree-corrected clustering assortativity. 

We pursue the hypothesis by first investigating the regions of the networks 
occupied by different types of groups. Figure shows group degree profiles that 
plot mean group parameter (r) against node degree k. These do not provide any 
clear insight into the structure of the networks, due to a rather extensive overlaps 
between the groups, i.e., both high and low degree nodes are included into different 
groups. On the other hand, group degree-corrected clustering profiles in Figure pT] 
clearly show that software network indeed consists of module-like groups with r ^ 0 
in sparse regions with low clustering d ^ 0 as hypothesized, while the plot reveals 
an expected increasing trend. Similarly, the network contains mostly community- 
like groups with r ~ 1 in dense regions with high clustering d ~ 1; however, the 
corresponding nodes are included also in overlapping module-like groups thus r ^ 


%. 

6b 

I 


lucene 


0.5 


-0 


■ Clustering c 
A D.-c. clustering d 



4 


Node (degree-corrected) clustering 


internet 



eollaboration 



Node (degree-eorrected) clustering 


Fig. 11. Group (degree-corrected) clustering profiles of larger networks. Note that lucene software 
network consists of module-like groups with r 0 in regions with d 0 a.s the Internet and mostly 
community-like groups with r 1 in regions with d 1 as the collaboration network. 
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0.5 (see Figure[^. The same observations apply for social network and the Internet. 

We next consider group degree and clustering mixing. For this purpose, we define 
group degree mixing coefficient f, f G [—1,1], as 

" = < 8 ) 

kt s,T 

where ks is the degree of group S', i.e., ks = similarly for the pattern 

degree kr- We further define also directed group degree mixing coefficients r(a,/3): 

P G {m, out}, and group clustering mixing coefficients fc and Vdi symmetrically as 
in Section These provide an overview of degree and clustering mixing in regions 
covered by groups of nodes, and enable reasoning about the network structure 
implied by different types of groups. 

Table [^displays group mixing coefficients. Most evidently, almost all correlations 
observed in the case of node mixing are strictly enhanced (see Table [^. Social 
network is assortative by degree, while the Internet is degree disassortative. Software 
networks again reveal disassortativity in the in-degrees. However, in contrast to 
before, group structure in fact promotes assortativity by out-degree in all software 
networks except colt network, due to the reason given in Section |4.2| Figure \T2\ 
shows also group pattern connectivity plots. For software network, one can clearly 
observe an increasing trend in the case out-degrees, and also larger in-degrees, which 
is obviously an artifact of community-like groups, as in the case of social network. 
Otherwise, in-degree profile has a decreasing structure similar to that of the Internet, 
which signifies module-like groups. Thus, confirming the above hypothesis, group 
structure of software networks can indeed explain dichotomous degree mixing with 
module-like groups responsible for disassortativity, most notably seen for smaller 
in-degrees, and community-like groups promoting assortativity in the out-degrees. 


Table 8. Group degree and clustering mixing coefficients of different networks. 


Network 

r 



{out,in) 

{out,out) 

Tc 

rd 

jhullet 

-0.02 

-0.15 

-0.01 

-0.20 

0.66 

0.47 

0.97 

colt 

-0.63 

-0.60 

-0.27 

-0.63 

-0.17 

-0.59 

0.76 

jung 

-0.32 

-0.32 

-0.12 

-0.30 

0.54 

0.45 

0.78 

lucene 

-0.16 

-0.19 

-0.12 

-0.22 

0.39 

0.17 

0.85 

internet 

-0.54 

- 

- 

- 

- 

-0.37 

0.37 

eollahoration 

0.84 

- 

- 

- 

- 

0.81 

0.95 


It ought to be mentioned that the above relation between degree mixing and 
different groups of nodes can be justified theoretically. Since S' = T for communities, 
this implies degree assortativity, as long as the sizes of communities differ [36]. Also, 
for s ^ t, module-like groups should result in degree disassortativity [48]. Finally, 
according to discussion in Section [4^ modules or communities are best pronounced 
through the out-degrees and in-degrees of nodes, respectively. 
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Fig. 12. Group pattern connectivity plots of larger networks (see also Table [^. Note that lucene 
software network reveals assortative mixing by out-degree as social networks (e.g., collaboration 
network) and disassortative mixing by in-degree as the Internet. While the former is an artifact 
of community-like groups, the latter is in fact a signature module-like groups. 


Table also reports group clustering mixing coefficients. As before, fc < 0 in 
some degree disassortative networks, due to the biases introduced in clustering c 
(see Section 3.3). Nevertheless, degree-corrected clustering mixing fg signifies ex¬ 
tremely assortative structure with correlations between 0.75 and 0.95 for software 
and social networks (see also Figure 13). Presence of clear groups of nodes thus 
indeed implies degree-corrected clustering assortativity, while the value of Vd can be 
related to the quality of network group structure. For example, in the case of the 
Internet, which has least clear group structure (see Section 4.2), fd is only 0.37. 

In summary, characteristic groups of nodes provide an important insight into 
the dynamics of complex networks and can, at least to some extent, explain the 
unique structure of software networks (i.e., degree and clustering mixing). There is 
of course no reason why the same principles should not apply to other real-world 
networks, directed or undirected, which will be thoroughly explored in future work. 


lucene 


1 . H Clustering c 
3 A D.-c. clustering d 



-0 0.5 1 

Group (degree-corrected) clustering 


internet 



-0 0.5 1 

Group (degree-corrected) clustering 


eollaboration 



-0 0.5 1 

Group (degree-corrected) clustering 


Fig. 13. Group pattern (degree-corrected) clustering plots of larger networks (see also Table |^. 
Note that networks reveal extremely clear group degree-corrected clustering m assortativity (e.g., 
lucene and collaboration network), which is an indication of a well pronounced group structure. 
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(a) Community in jung network (r = 1) (b) Module-like group in colt network (r = 0.06) 


Fig. 14. Most significant groups of nodes extracted from different software networks (see also Ta¬ 
ble]^. The groups correspond to (a) core classes of the software project and (b) different im¬ 
plementations of classes with the same functionality. (Nodes with degree-corrected clustering m 
above or below the mean are shown as circles and triangles, respectively.) 


5. Applications in software engineering 

The present section describes preliminary work on practical applications of net¬ 
work group detection in software engineering. As already discussed before, groups 
of nodes in software dependency networks coincide with the intrinsic properties of 
the underlying software systems. For instance, Figureshows the most significant 
groups revealed in jung and colt networks. In the case of the former, the best group 
is a community that corresponds to core classes of the project, as predicted in Sec¬ 
tion [T^ Since the network represents a framework for graph and network analysis, 
these are actually different graphs, multigraphs, hypergraphs and trees. Notice that 
the revealed group is not only very clear, but also rather exhaustive. 

On the other hand, the most significant group in colt network, which represents 
a software library for high-performance scientific computing, is module-like and 
contains different implementations of matrices (e.g., dense, sparse or wrapped). Re¬ 
call that the latter is consistent with the rationale behind the existence of modules 
in software networks given in Section [T^ Similarly as above, the group is indeed 
transparent, while the identifiers of the corresponding software classes are extremely 
consistent with each other (see Figure [r4(b)[ ). Thus, one can in fact derive templates 
for class identifiers (e.g., by mining common textual patterns m) and unique class 
dependencies on the level of groups of nodes in a software network (i.e., by analyzing 
corresponding node patterns). These can be adopted in future project development, 
in order to maintain a high consistency of a software system, to reduce code dupli¬ 
cation issues and other. Furthermore, one can also predict the package of a class. 

Classes of object-oriented software systems are organized into software packages 
that form a complex hierarchy. Each class is a member of exactly one package, 
whereas the classes can reside also in the inner nodes of the package hierarchy. 
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For example, the group of nodes shown in Figure |14(a)| consists mostly of classes 
in edu.uci. ics .jung.graph package, while the group in Figure p~4(b)| represents 
classes in cern. colt .matrix, impl package. To predict the package of some class 
given the group structure of the software network, we investigate the classes, whose 
nodes are residing in the same network groups as the concerned one. These classes 
are then weighted according to the Jaccard similarity [22] between the correspond¬ 
ing nodes’ neighborhoods and their packages are taken as the candidates for the 
prediction. We select the most frequent package with respect to weights, while ties 
are broken uniformly at random (see |46l|47] for details). Note that, instead of con¬ 
sidering nodes within the same network groups, one can of course examine merely 
nodes’ neighbors or the entire network. For comparison, we also report the perfor¬ 
mance of a classifier that predicts the most frequent (i.e., majority) package within 
the software system for each class and a random classifier. However, the adoption of 
some more sophisticated approaches like deep belief nets [20] or structured support 
vector machine [50] would inevitably require the identification of learning features. 

Table shows classification accuracy for software package prediction. Observe 
that the accuracy for the strategy based on network groups is around 75% in all cases 
except for the larger lucene network. We stress that the latter is an impressive result. 
Indeed, the task at hand represents an extremely difficult classification problem due 
to a large number of possible classifications, while this number is else two or three 
in most practical applications (see performance of the baseline classifiers). Notice 
also that the strategy based on nodes’ neighbors performs very well in jbullet and 
jung networks with more community-like groups (see (r) in Table [^, since the 
groups well coincide with nodes’ neighborhoods. On the other hand, the neighbors 
are in fact different from one another in colt network with more module-like groups 
(see Section]^, which significantly decreases the performance. 


Table 9. Classification accuracy of software package prediction based on the node’s neigh¬ 
bors r or groups S', or the entire network N (see text for details). 


Network 

# Classes'^ 

Packages 

r 

s 

N 

Majority 

Random 

jhullet 

107 

11 

72.0% 

75.7% 

64.5% 

28.0% 

8.6%o 

colt 

154 

16 

58.4% 

73.4% 

55.2% 

22.7%o 

5.5% 

jung 

237 

31 

72.2% 

74.2% 

65.0% 

11.4%> 

3.3%o 

lucene 

1335 

178 

47.1% 

49.2% 

43.7% 

6.4%, 

0.5% 


Note: Results are averages over 100 runs 

^Analysis is reduced to nodes included in network groups 


Table 10 shows also the accuracy for high-level software package prediction prob¬ 
lem, where we consider only the packages at the topmost level of the package hi¬ 
erarchy. For jung network, these are graph, algorithms, io, visualization and 
visualizationSd (prefix edu.uci . ics . jung is omitted). Again, the strategy based 
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Table 10. Classification accuracy of high-level software package prediction based on the 
node’s neighbors V or groups S', or the entire network N (see text for details). 


Network 

# Classes'^ 

# Packages 

r 

s 

N 

Majority 

Random 

jbullet 

107 

5 

84.6% 

85.0% 

78.5% 

64-5%. 

20.4%o 

eolt 

154 

10 

86.4% 

83.8% 

69.5% 

39.0% 

9.7% 

jung 

237 

5 

89.1% 

90.5% 

91.1% 

U-3%> 

20.3%o 

lucene 

1335 

15 

85.5% 

90.8% 

85.0% 

28.2%, 

6.6% 


Note: Results are averages over 100 runs 

^Analysis is reduced to nodes included in network groups 


on network groups performs particularly well with classification accuracy around 
85-90%. Besides, the strategy based on nodes’ neighbors, and also the network-based 
strategy for jung network, obtains surprisingly high results, which further justifies 
the construction of software dependency networks (see Section]^. 

Thus, characteristic group structure of software networks can indeed be exploited 
to quite accurately infer the package hierarchy of software systems [MIIIT]. This 
has numerous applications. For instance, the framework can be used to predict 
packages of new classes introduced into an unknown software project or even the 
programming language itself, to detect possibly duplicated classes, or for merging 
classes across different software packages or libraries (one by one). Such tasks would 
else demand significant manual labor, especially for large and complex software 
systems. Furthermore, network group detection can be adopted for software project 
refactoring, in order to derive either more modular or more functional software 
package hierarchy gaiiH] (i.e., community-like and module-like, respectively). 

As shown below, characteristic groups in software networks can also be used to 
infer the name of the developer that implemented a particular class, the exact ver¬ 
sion at which it was introduced into the project or its type (i.e., class or interface). 
However, as this information was largely unavailable or could not be obtained au¬ 
tomatically for the software projects considered, we only report the results for jung 
network. The prediction else proceeds exactly the same as before, while the classes 
with unknown version or author information are grouped into a single category. 

Table pr] shows the classification accuracy for different software prediction prob- 


Table 11. Classification accuracy of class prediction for jung software network based 
on the node’s neighbors T or groups S', or the entire network N (see text for details). 


Prediction 

# Categories 

P 

s 

N 

Majority 

Random 

Class type 

2 

65.0% 

85.2% 

84.8% 

84.4% 

49.9% 

Class version 

9 

67.7% 

72.8% 

66.2% 

44-3% 

11.2% 

Class author 

11 

71.6% 

71.0% 

70.9% 

44-3% 

9.2% 


Note: Results are averages over 100 runs 















February 19, 2015 2:0 WSPC/INSTRUCTION FILE 


Node mixing and group structure of complex software networks 23 


lems. For class type prediction, the strategy based on network groups performs only 
slightly better than the baseline approach that classifies all software classes into the 
same category. On the other hand, the performance is significantly improved in 
the case of class version and author prediction problems with accuracy over 70%. 
This is not very surprising, since classes with the same functionality that appear 
as different groups in software networks are commonly introduced within the same 
version of the software project and implemented by the same developer. 

Furthermore, according to Section |4.2[ the quality of network group structure 
reflects different programming principles and paradigms. Since this can be measured 
by degree-corrected group clustering mixing (see Section 4.3), the latter enables 
different applications in software development and quality control. 


6. Conclusions and future work 

The present paper rigorously analyzes the structure of complex software networks. 
These can be seen as an interplay between a dense structure of social networks and 
a sparse topology of the Internet. In particular, we show that software networks 
reveal characteristic node group structure, which consists of dense communities, 
sparse module-like groups and also different mixtures of these. Communities imply 
assortative mixing by degree, whereas just the opposite holds for the modules. 
Thus, software networks reveal dichotomous degree mixing that is assortative in the 
out-degrees and disassortative in the in-degrees. Furthermore, communities appear 
in denser regions with higher clustering, while most pronounced modules occupy 
sparse regions with very low clustering. The latter in fact promotes degree-corrected 
clustering assort at ivity, which is observed in all of the networks analyzed. 

Besides, the group structure of software networks also coincides with the intrin¬ 
sic properties of the underlying software systems. The paper thus includes some 
preliminary work on practical applications of network group detection in software 
engineering. Nevertheless, their true practical value in real scenarios remains some¬ 
what unclear and will be more throughly investigated in the future. 

The study of differences between software and social networks, and the Internet, 
reveals notably distinct network topologies that are most likely governed by different 
phenomena. We stress that dichotomous node degree mixing has not yet been ob¬ 
served in the case of directed networks. Furthermore, preliminary results show that 
the existing graph models do not produce degree-corrected clustering assortativity 
of real-world networks. The latter will be the main focus of our future work. 

Additionally, the paper implies several other prominent directions for future re¬ 
search. First, the observed node mixing and group structure might also apply to 
different software and other real-world networks. Among these, various information 
networks seem most promising. Next, characteristic group structure revealed for 
software networks might be further related to other properties, e.g., self-similarity [4] 
or hierarchical structure [55]. Last, although we provide some rationale for the pres¬ 
ence of groups in software networks, a generative graph model is still an open issue. 
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